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RJapid  Training  oil  GIL  Neural  Networks 

Clank  Jeffnies1Ja  and'  Peter  Klessler2^  anid  Louis  Ntasinw. 

aBoxi  1121951  IBM  Corp.,  Research!  Triangle  Park,  NC  27709 
^Dept.  of  Mathematical  Sciences,  Clemisoni  Uniiv.,  Clemson,  SC  29632. 

Abstract 

Applying  generalized  inverse  learning  to  a  feedfoward  neural  network  has  been  shown  to  be  an 
effective  tool  in  pattern  recognition.  The  difficult  computational  step  is  finding  the  pseudo-inverse 
of  a  matrix.  In*  this  paper,  we  develop  an  efficient  method  using  differential  equations  to  calculate 
the  pseudo-inverse. 
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1  Introduction 

Applying  generalized  inverse  learning  (GIL)  to  a  fee dfo ward  neural  network  has  been  shown  to  be  an 
effective  tool  in  pattern  recognition.  Both!  Hanson  |jl]  and!  Kaye  f2]  were  successful  in  applying  GIL 
to  scenarios  using  real  data.  The  key  feature  in  GIL  is  that  it  trains  th'e  network  in  a  single  iteration. 
This  suggests  that  there  is  potential  for  developing  a  pattern  recognition  tool  that  can  be  used  in 
the  held.  However,  first  we  need  to  overcome  the  difficult  computational  9tep  in  finding  the  pseudo- 
inverse  of  a  matrix.  In  this  paper,  we  develop  efficient  methods  that  calculate  the  pseudo-inverse, 
that  is,  the  Moore-Penrose  generalized  inverse,  see  [3],  via  differential  equations. 

GIL  was  introduced  by  Peth'el  et.  al.  for  for  training  feedforward  neural  networks.  This  is 
th'e  third  project  in  Clemson  university  involving  GIL’.  As  mentioned1  above  the  key  step  in  training 
the  network  is  in  finding  the  Moore-Penrose  inverse  of  a  matrix  M.  In  th'e  first  project,  Kaye 
|]2]  developed  a  training  technique  based  upon  the  singular  value  decomposition  of  th'e  matrix  M. 
In  the  second  project,  Hanson  |1]  successfully  used1  th'e  procedure  developed  by  Kaye  for  pattern 
recognition.  The  purpose  of  this  project  was  to  develop  an-  algorithm  for  training  the  network 
that  was  both  fast  and  simple  enough  to  be  implemented  on  a  chip.  The  algorithm  we  developed 
satisfies  these  properties.  Nat  only  has  the  algorithm  been  successsfully  tested  but  we  have  provided 
justification  that  it  converges  to  the  Moore-Penrose  inverse  of  M. 

Th’e  next  step  is  to  apply;  the  algorithm  in  a  tracking  problem.  While  some  steps  have  been 
made  in  this  direction,  more  needs  to  be  done.  In  section  (6),  we  will  elaborate  on  future  possibilities 
in  this  direction. 

Thb  paper  is  organized  as  follows.  In  section  2,  we  present  a  brief  discussion  on  the  background. 
Section  3  then  discusses  the  feedfaward  neural  network  and  GIL.  There  we  elaborate  on  the  con¬ 
struction  of  GIL.  In  section  4  we  discuss  efficient  tecniques  on  finding  the  pseudo- inverse.  Wb  outline 
several  methods  for  finding  th'e  pseudo-inverse  and  examine  the  performance  of  each  method.  We 
conclude  the  section  with  a  discussion  of  the  guidelines  for  a  stopping  rule  for,  the  algorithm.  In 
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section  5  we  present  convergence  results.  Lastly,  section  6  provides  concluding  remarks  as  well-  as 
suggestions  fbr  further,  research. 

2  Background 


Generalized  Inverse  Learning  (GIL)  is  a  technique  recently  developed  for  training  feedforward,  hid¬ 
den  layer  neural  networks.  This  technique  is  based  on  the  Moore- Penrose  generalized  Inverse  of  a 
matrix).  Since  it’s  introduction  by  Pethel  et  al.,  GIL  has  proven  itself  as  a  fast  and  efficient  tool 
fbr  pattern  recognition.  It  has  been  applied  in  predicting  time  series  data,  acoustic  data  as  well  as 
pattern  recognition  in  grayscale  images. 

GIL  as  developed  by  Pethel  et  al.  can  be  summarized  as: 

G(IA)B  =  0  (|1) 

where  II  and  0  are  input  and  output  matrices  respectively,  A  and  B  are  the  weight  matrices,  G  is 
a  nonlinear  transfer  function,  JGfX:)]^  =  |[]1  -H  tanhl(X^)]  with  inverse 

[G~i:flX]]^  =  arctanh(2X^  —  II).  Training  is  achieved  bly  an  iterative  process  in  which  the  entries 
in  A  and  B  are  adjusted  till  the  quantity 

e  = 

Th 

is  minimized,  where  <3>  is  the  actual  (target)  output. 

This  training  technique  depends  on  the  left  and  right  Moore-Penrose  generalized  inverses  of 
G(\IA].  Let  M  be  an  m  x  n  matrix,  then  the  left  and  right  generalized  inverses  of  M  are  defined  as 

M”1  -  {MTM)~lMT  and  M"1  -  MT(\MMT]'~l 

respectively.  Let  n  <  m  and  assume  M  is  of  rank  n,  then  MT M  has  full  rank  and  is  thus  invertible 
while  MMt  is  singular.  Thus  M£l  can  be  computed  while  M#1  is  undefined.  Realizing  how 
restricted  GIL’  was  because  of  such  a  training  technique,  a  new  training  approach  was  proposed 
(Kaye)  using  singular  value  decomposition  (SVD),  in  which;  the  iterative  scheme  was  reduced  to  the 
solution  of  a  system  of  linear  equations:  MB  —  0  (  where  M  =>  G(I\A)).  For  any  m  x  n  matrix)  M 
the  singular  value  decomposition  can  be  written  as  M  —  V\EWF\  where  VI  and  W  are  orthogonal 
matrices  and  S  Is  diagonal  with  the  singular  values  of  M .  These  singular  values  are  the  positive 
square  roots  of  the  eigenvalues  of  MTM\  The  generalized  inver  se  (pseudo-inverse)  of  M  is  t  herefore 
Aft  =.  W2Tt'V lT,  where  H"T  is  the  transpose  of  S  with  the  diagonal  entries  being  the  reciprocals 
of  the  strictly  positive  diagonal  entries  in  S.  For  any  rectangular  system  Ax.  =>  b ,  SVD  gives  the 
best  solution  in  the  least  squares  sense,  that  is,  the  solution  that  minimizes  E  =  SmC®™  —  $ra]3- 
This  change  in  training  method  led  to  GIL-SVD.  An  attractive  property  of  GIL-SVD  is  it's  one  step 
training. 

Kaye  J2]  then  applied  GIL-SVD  to  nonlinear  systems,  chaotic  time  series  data  and  to  periodic 
functions  sampled  at  regular  intervals.  Using  a  map  that  consisted  of  a  system  of  nonlinear  equations, 
GIL-SVD  in  one  pass  attained  the  accuracy  level  thlat  GIL  took  15  iterations  to  attain.  This 
confirmed  the  superiority  of  GIL'-SVD  over  GIL. 


Using  the  first  25  points  of  the  chaotic  time  series  map 

Xi+x  =  4Xi(l  -  Xil  Xo  G  (o,  I) 

to  train,  GIL-SVD  was  able  to  predict  the  26^  with  an  average  accuracy  of  10“2.  In  fact  it  was 
shown  that  the  prediction  of  the  next  point  on  the  time  series  map  did  not  significantly  depend  on 
the  length  of  the  input. 

The  last  application  of1  GIL-SVD  by  [2]  was  on  periodic  functions  (sine  curve  sampled  at  a  con¬ 
stant  rate).  Direct  application  of  GIL-SVD  yielded  high  errors.  A  technique  called  the  embedding- 
dimension  was  adopted  to  reduce  the  error.  An  optimum  embedding  dimension  of  3  wa9  found  and 
the  error  in  predicting  was  of  the  order  of  IQ-4. 

Hanson  flli]  applied  GIL-SVD  to  pattern  recognition  in  grayscale  images.  To  meet  up  with  the 
requirements  of  this  particular  application  a  version  of  GIL-SVD  was  defined  as  follows:  Let  an 
input  vector  I  and  a  desired  output  vector  D  be  given.  Construct  the  weight  vector  A  from  selected 
entries  in  I  as 

I  .  r 

^ — ->  H  G  1 

2  *it 

Define  the  transfer  function  G(-x)  as 

ehx  +;  ®) 

where  k  is  the  gain  parameter  for  G{x).  This  particular  transfer,  function  was  chosen  for  it’s  algebraic 
structure  (which  we  shall  exploit  later).  The  one  setback  of  this  version  of  GIL-SVD  so  far  has 
been  the  rank  deficiency  of  the  matrix  GQTA}.  Thus  an  exact  solution  for  (1)  using  the  SVD 
technique  cannot  be  guaranteed.  To  compensate  for  this  effect  of  rank  deficiency,  a  shift  vector  C 
was  introduced  to  absorb  the  difference  between  D  and  the  approximate  output  obtained.  Thus  for 
this  particular  application  GIL-SVD  was  defined  a9 

G(IA)B  MC  =>  D 

Using  this  particular  version  Hanson  showed  that  GIL-SVD  was  a  perfect  classifier  of  patterns 
in  grayscale  images.  Errors  from  patterns  that  were  not  of  the  training  type  showed  a  marked 
difference  when  compared  to  errors  from  patterns  that  were  of  the  training  type.  Wiith  G^IA)  being 
rank  deficient  in  some  cases,  the  entries  of  the  pseudo-inverse  become  very  large  depending  on  the 
degree  of  deficiency.  This  introduces  the  problem  of  “brittle  fits”,  that  is,  the  sensitivity  of  the 
system  becomes  too  high  thus  rendering  the  network  unreliable.  Computing  reliable  least  squares 
solutions  to  rectangular  systems  has  been  a  chalenging  problem.  For  notational  purposes  we  shall 
maintain  the  name  GIL  while  refering  to  GIL-SVD. 

3  Artificial  Neural  Networks  and  GIL 

Wfe  consider,  a  feedfoward  network'  whose  architecture  is  shown  in  Figure  1  and  use  generalized 
Inverse  learning  (GIL)  for  training  the  network.  The  primary  reason  for  using  GIL  is  that  it  trains 
a  neural  network  i'n  a  single  iteration.  This  section  describes  how  GIL  trains  a  neural1  network  and 
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Figure  1!j  A  Standard  Feedforward  Neural  Network. 


illustrates  thlat  thle  key  computational  problem  is  Ending  the  inverse  of  a  matrix.  In  the  next  section, 
we  discuss  methods  for  calculating  this  inverse. 

Given:  an  input  vector  I  =  (I*,  J2, . . . ,  Im)T  and  a  desired  output  vector  D  =  (Dj,  D2,  -  - . ,  -Dm)T, 
the  neural  network  represents  the  function 

231  -  G(IA]B  +;  Cl  <\2) 

where  A  and  B  are  weight  vectors,  C  a  shift  vector  and  G  is  the  gain  function  which  are  determined 
as  follows. 

To  determine  the  weight  vector  A  select  n  representative  input  values  flro  m]  among  the  {A}. 
Then 

(Ai,A2,...,An)  = 

Given  an  m  x  n  matrix  Af,  G{M)  is  a  m  x  n  matrix  whose  ( i,j)th  element  is  g,(\Mij]  where 

2ekx 

gkx  ,j_  g/u(l— x) ' 


1  _1__  II  \ 

2h'2h'"''2I^)' 


Thus,  the  gain  function  G(\IA)  is  an  m  x  n  matrix  which  we  assume  has  rank  n. 


The  weight  vector  B  is  found  from  the  equation 


D  -  G(IA)B. 

In  the  case  where  n  —  m  then  B  =  G(IA)~lD  and  we  take  Q  =>  0.  When  to  <  ra,  which  is  usually 
the  case,  D  —  G(IA)B  does  not  have  a  solution.  In  this  case  B  is  the  vector  that  minimizes  the 
distance  between  D  and  G(IA)B  and  is  given  by 

B  =  G(TAfD, 

where  G^IA)}  is  the  pseudo-inverse  of  G(IA)  .  The  shift  vector  C  is  then  found  bly 

C  =  D  -  G(IA). 


It  follows  that  the  key  step  in  the  analysis  is  in  finding  G(IA)^.  We  conoluda  this  section 
by  discussing  some  of  the  properties  of  this  matrix.  To  start,  consider  the  function  g(xj  which  is 
sigmoid  aind  ramps  from  0  to  2,  taking  value  1  when  x  =  The  parameter  k  is  called  the  gain  of 
the  sigmoid  function  and  represents  the  slope  of  the  sigmoid  function  at  x  = 

Fbr  very  large  values  of  k,  g(x)  approximates  a  step  function  that  maps  x  <  5  to  0,  x  >  | 
to  2  and  x  =  \  to  1.  So  as  U  -»  00,  G{IA)  — >■  M,  the  m  x  n  matrix 


up  to  permutation  of  rows. 
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M  ha is  maximal  rank  and  a  left  pseudo-inverse  of  M  can  readily  be  computed'  as 
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M^M  is  the  to  x  to  identity  matrix,  but  is  not'  the  m  x  m  identity  matrix  Im.  Our 

main  task  here  is  to  find  for.  any  value  of  k  the  matrix  otherwise  known  as  the  pseudo- inverse, 


suchl  that  M^M  is  1^  and  MM^is  as  close  to  Im  as  possible.  Such  an  M t  should  minimize  the  sum 
of  squares  oft  the  entries  of  MM^  -  Im . 

Given  any  rectangular  m  x  n  matrix  M,  a  singular,  value  decomposition  of  M  can  always 
bie  written  as  Ml  =  C/lEVIr,  where  U  (jrra  x  m)  and  VI  {n  x  n]  are  bloth  orthogonal  and  E  is  an 
m  x  n  diagonal  matrix  withl  nonnegative  diagonal  entries  ('known  as  the  singular,  values  of  M) 
nonincreasing  down  from  the  (1,1)  position.  If  M  hlas  maximal  rank  n,  then  all  singular  values  are 
positive,  otherwise  M  is  rank  deficient  and  some  singular  values  may  be  zer-o  or  close  to  zero.  The 
pseudo-inverse  M t  =>  FE*?7T,  where  E*  is  the  n  x  m,  diagonal  matrix  with  diagonal  entries  the 
inverses  of  the  positive  diagonal  entries  in  S. 

Constructing  A  from  the  largest  n  entries  of  7:  and  letting  k  — >  oo,  G{I'A)  approximates  M 
with  pseudo-inverse  M t  as  below.  For  this  case  M^M  is  In  and  \\MM t:  —  Jm||$  =  (m  —  nj 2  whichl 
is  the  theoretical  lowest  possible. 
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4  Efficient  Techniques  in  finding 

It  is  always  possible  to  satisfy  (2)  with  C/0.  For  n^m  and  sufficiently  large  k,  G(IA)  is  square 
and  full  rank  thus  invertible  giving  a  solution  for  (2)  with  C  =  0.  Our  goal  therefore  at  this  point  is 
to  find  the  best  choices  of  A  and  B  for  any  given  I  and  D  with  the  entries  in  C  having  the  smallest 
magnitude.  This  choice  of  A  and  B  is  obtained  by  finding  the  pseudo-inverse  of  GI\IA). 

Various  techniques  have  been  developed  for  computing  this  pseudo-inverse,  most  prominent 
being  the  singular  value  decomposition  we  mentioned  earlier.  These  techniques  are  not  adequate  for 
our  purpose.  The  ultimate  objective  of  this  work  is  to  come  up  with  a  scheme  that  is  accurate,  fast 
and  stable  enough  for  finite  precision  arithmetic.  The  speed  required  here  should  be  good  enough 
to  accommodate  real  time  applications  of  ttfe  GIL.  A  typical  real  time  application  demands  that 
G(IA)  be  at  least  of:  size  150  x  10. 

4*1  The  basic  idea:  ODE  approach. 


Our  technique  for  finding  G(TA)^  is  via  differential  equations.  Define  a  hypersurface  of  dimension 
nm  and  consider  that  each  choice  ofl  A  and  B  yields  a  different  position  on  the  hypersurface, 
corresponding  to  a  particular  approximation  of  G(IA)^.  In  2-dimensions  a  graphical  representation 
of  the  problem  can  be  summarized  as  follows.  Let  I  —  (A ,  J2, . . . ,  Im)T  be  any  input  vector  and 
let  a  desired  output  vector  be  D  =  (Di,  £>2, . . . ,  Dm)T .  Then  the  graph  below  shows  the  curve 

tThlrmigh!  out  thlis  paper,  [|.XI||  =  ® |X!(i,  j)}2)* . 


that  represents  the  possible  input /out  put  pairs  with  various  choices  of  weight  vectors  A  and!  B. 
Using  the  singular  value  decomposition  we  can  always  find  that  point  on  the  curve  that  i9  closest  to 
the  given  point  (ii,  Jh, . . . ,  Im ,  Ai,  A,  •  •  *  >  Dm)T  with  respect  to  the  Euclidean  norm.  The  singular 
value  decomposition  inverse  that  minimizes  the  distance  between  the  curve  and  the  given  point  is 
the  pseudo-inverse. 

(Output  Vector) 


Figure  2:  The  singular  value  decomposition  generates  a  curve  of  possible  input/output  pairs  with 
various  choices  of  A  and  B. 


So  given  the  differential  equation 

dX 
dt 

we  therefore  find  an  initial  poilnt  X(0)  so  that  the  above  equation  has  a  solution  that  converges 
to  G(IA)^.  We  compare  two  cases,  one  where  /  is  linear  and  the  other,  where  /  is  quadratic. 

In  the  linear  case  we  have 

m 

This  is  a  first  order  linear  homogeneous  ordinary  differential  equation  in  X.  To  show  that  this 
equation  converges  to  the  pseudo-inverse  consider  the  quantity 


n 

k=l 


where 


% 


r  i  i=j 

1  0  i  ±  j 


S-  represents  the  distance  MX  is  from  the  identity  Im.  The  rate  of  change  of  5  along  a  trajectory  can 
be  shown  to  be  —  Consequently,  S  always  decreases  along  all1  non-constant  trajectories. 

The  value  of  S  is  finite  at  any  point  in  the  nm-dimensional  space.  Thus  trajectories  asymptotically 
approach  the  pseudo-inverse.  A  dynamical  system  for  finding  the  pseudo-inverse  must  keep  X\M 
close  to  In  while  changing  X<  so  MX  approximates  7m. 


Vkrious  iterative  schemes  have  been  defined  using  this  differential  equation  as  the  basis  for 
evolution  of  trajectories  on  the  hypersurface.  This  differential  equation  is  “well  behaved”  in  that 
it  provides  very  stable  trajectories  that  guarantee  convergence  as  long  as  the  initial  point  is  on  a 
non-constant  trajectory.  The  main  disadvantage  of  this  differential  equation  and  resulting  schemes 
is  the  convergence  rate  which  is  linear  in  all  the  cases  studied.  The  rate  of  change  of  any  trajectory 
is  linear  in  X  which  results  in  a  fixed  step  size  Alt. 

The  schemes  developed  and  studied  so  far  include 

X\{t  -H  At)  =  X\(t)  +  {Mt  -  X{l)MMT}m  (54) 

which  was  modified  to  the  dual  scheme 

X(it  +:  At)  =  X(it)-H{MT -MTMX(t)}At 

X()t  -H  At)  =  +:  -  X(t)MMT}At.  (5) 

Alisa  schemes  in  the  line  of  Runge  Kutta  orders  2  and  4  have  been  studied.  The  Newton’s  method 
has  also  been  implemented  defining  the  function  to  be  solved  as  F(jX)  =  Im  —  MX.  This  approach 
showed  the  expected  quadratic  convergence  rate,  but  lost  some  degree  of  speed  due  to  the  lack  ofl  a 
variable  step  size. 

Due  to  speed  limitations  in  the  schemes  derived  from  (3)  the  following  quadratic  differential 
equation  was  defined. 


-=,{XiIm-MX,)}  (6) 

Since  this  differential  equation  defines  a  continuous  map  in  the  hJypersurface  we  can  specify  a  math¬ 
ematical  theory  that  leads  to  the  difference  equation 


X(k  +  1)  =  -Xi(fc)  +  X(k){Im  -  MXm At  P) 

This  scheme  has  a  convergence  rate  that  is  at  least  quadratic  thus  providing  the  necessary  speed. 
Due  to  the  selective  nature  of  the  scheme  all  starting  points  do  not  converge  to  the  pseudo-inverse. 
See  the  appendix  for  a  proof:  of  the  optimum  choice  of  At  and  a  guarantee  of  a  starting  point  for 
this  particular  class  of  problems. 


4.2  Comparison  of  the  methods 

Based  on  (]3)  above  three  different  schemes  were  developed:  (4)  later  modified  to  (j5)|,  a  Runge 
Kutta  order  2  scheme  and  a  RJunge  Kutta  order  4  scheme.  The  Tables(l  and  2)  below  show  typical 
performance  of  these  schemes  on  two  different  matrices.  The  results  are  based  on  200  iterations  for 
each  scheme.  Each  matrix  is  a  gray;  scale  representation  of  some  picture,  some  of  which  are  adapted 
from  Hanson  []!]. 


ikfll(|60  x  10] 

Error  (Left) 
In  -  XM 

Error.(Riglit) 
Im -MX, 

Largest  Entry 
in  \X\ 

Number  of 
flops  (Millions) 

Modified 
Scheme  (3) 

2.53 

7.63 

2.3 

17.6 

Runge-Kutta 
Order  2 

2.53 

7.63 

2.3 

35.6 

Runge-Kutta 
Order  4 

2.53 

7.63 

i 

2.3 

72 

Table  L 


To  provide  a  common  measuring  unit  for  computational  complexity  the  number  of  floating 
point  operations  (flops)  was  counted  for  each  scheme.  The  two  examples  in  Tables  1  and  2  show 
that  the  costs  to  attain  the  accuracy  levels  in  both  tables  double  as  you  go  from  (5)  to  the  R'unge 
Kutta  order  2  and  then-  to  the  Runge  Kutta  order  4. 


M2(jl50  x  10) 

Err  or  (Left) 
In  —  XM 

Error.(Right) 
Im  -  MX, 

Largest  Entry 
in  \X\ 

Number  of 
flops(Millions) 

Modified 
Scheme  (5) 

2.03 

12.70 

2.2 

98.3 

Runge-Kutta 
Order  2 

2.03 

12.70 

2.2 

197.3 

Runge-Kutta 
Order  4 

2.03 

12.70 

2.2 

396.5 

Table  2. 


The  performance  of  (7)  on  four  different  matrices  is  shown  below.  Table  (3)  is  based  on  40 
iterations  while  in  Table  (4)  the  stopping  criteria  is  based  on  the  size  of  the  entries  in;  XI,  stopping 
once  the  absolute  maximum  entry  exceeds  2  as  the  iterations  progress. 


Matrix  Size 

Error  (Left) 
In  —  XM 

Error(Right) 
Im- MX 

Largest  Entry 
in  \X\ 

Number  of 
flops(Millions)| 

M(j60  x  10) 

0.00 

7.07 

3649 

1.7 

M(150  x  10) 

0.00 

11.8 

91 

3.9 

M(175  x  10) 

0.00 

12.8 

4.0 

4.6 

M(1596  x  10) 

0.00 

40 

.15 

40.7 

Table  3. 


^From  Tables  1  to  4  it  can  be  seen  that  the  schemes  based  on  (3)  are  computationally  expensive 
relative  to  (7)  that  is  based  on  (6).  Though  (7)  proves  to  be  computationally  less  expensive,  the 
schemes  based  on  (3)  have  an  advantage  that  (7)  lacks]  global  convergence. 


Matrix  Size 

Number  of ' 
Iterations 

Error(Left) 

iu-xm' 

Error  (Right  ) 

I'm  -  MX 

Largest  Entry- 
in  \X\ 

Number  of 
flops  (Millions)! 

M(0O  x  10] 

ii 

2.00 

7.34 

2.54 

.5 

Ml(150  x  10) 

14' 

1.00 

11.9 

2.35 

1.4 

m\  175  x  10) 

13 

0.6 

12.9 

2.05 

1.5 

Mp596  x  10) 

10 

0.0 

40 

.11 

10.4 

Table  4. 


4.3  Stopping  Criteria 

In1  the  previous  subsection,  several  examples  have  shown  that  the  iterative  scheme  based  on  (7) 
converges  quickly  to  the  pseudo-inverse.  Wlhile  any  stopping  criteria  should  depend  on  the  accuracy 
ofl  X  tJo  the  pseudo-inverse,  there  is  another  issue  we  need  to  consider.  This  issue  is  the  conditioning 
of  Mi. 

The  conditioning  of  M  dramatically  affects  the  magnitude  of  the  elements  of  X.  Recall  that  the 
ultimate  objective  is  to  implement  this  scheme  with  finite  bit  arithmetic.  If  we  can  not  guarantee  a 
hound  on  the  magnitude  of  the  entries  in  X(ft),  that  renders  the  scheme  useless.  The  table  below 
shows  the  performance  of  (7)  on  four  matrices  with  different  conditioning.  One  can  easily  recognize 
the  dependence  on  the  conditioning  of  M  of  the  absolute  maximum  entry  in  X. 


Matrix 

size 

Largest  Entry 
in  |*| 

Condition 
No.  of  M 

M(60  x  10) 

469010 

M(150  x  10) 

91 

23540 

M(1 175  x  10) 

4 

1229 

M(179  x  10) 

.15 

541 

Table  5. 


This  particular  class  of  problems  guarantees  a  bound  on  the  entries  in  M,  but  does  not  guarantee 
the  conditioning  of  M.  There  is  one  main  characteristic  of  the  pseudo-inverse  that  is  of  interest  to 
us 3  it  is  unique  and  minimizes  \\MX  —  Jm||.  Consequently,  we  are  guaranteed  that  02)  is  solved  as 
well  as  possible.  With  poor  conditioning  of  M)  that  is,  M  being  close  to  singular  as  is  the  case 
with  some  of  the  M’s  above,  we  can  not  therefore  guarantee  a  bound  on  the  entries  in  X.  The 
entries  in  X  become  very  large  for  M  close  to  singular  resulting  in  brittle  fits,  rendering  the  GIL 
network  unreliable.  The  goal  is  to  choose  X  so  that  it  avoids  brittle  fits  without  compromising  the 
effectiveness  of  the  GIL  network. 

To  develop  a  stopping  rule  let  {Xn}|  be  the  sequence  of  matrices  generated  bly  (7)  converging 
to  X,  where  X  is  the  pseudoinverse  of  M .  Choose  N  large  enough  so  that  for  all  n  >  AT,,  ||XttM|| 
is  within  6  of  I.  Recall  that  51  —  XI nD  and  C  =  D  —  MLB,  by  an  appropriate  choice  of  Xn  we  can 
control  the  magnitude  of  the  entries  in  B  and  consequently  C.  By  the  choice  of  the  starting  point 
for  the  scheme,  the  sequence  {X^}  starts  with  terms  having  known  bounds  on  the  magnitude  of  the 
entries  thereby  making  it  possible  for  such1  a  choice  to  exist.  Recall  also  that  (7,  the  shift  vector, 
is  introduced  to  take  care  of  the  error  involved  in  using  the  pseudo-inverse.  That  makes  it  possible 


for  any  left  inverse  and  not  necessarily  the  pseudo-inverse  to  be  used.  Th!e  closer  Xn  gets  to  th'e 
pseudo-inverse  the  better  the  network  is  for  pattern  recognition. 


M  (60-by-10)  condition  Number  =  469,010  M  (150-by-IO)  Condition  Number  =  23540 


Figure  3:  Evolution  of  four  important  numerical  quanitities  in  the  dynamical  system. 


The  graphs  above  show  the  evolution  oflthe  maximum  entry  in  \X.\y  error  levels  and  step  size 
for  four  different  matrices  (Matrix  size  ajnd  condition  number  given  at  the  top  of  each  graph).  These 
graphs  represent  part  of  a  careful  experimentation  with  matrices  up  to  size  (175  x  10)  in  an  attempt 
to  determine  possible  stopping  criteria  for  (7).  The  graphs  in  Fig.  3  show  that  at  the  fifth  iteration 
the  following  observations  can  be  made: 


1.  For  each  case  the  largest  entry  in  X  has  magnitude  not  exceeding  1. 

2.  In  almost  all  of  the  cases  the  step  size  is  very,  close  to  1. 

3.  The  difference  between  the  error  from  the  right(||Im  —  MX ||)  and  the  theoretical  error  from 
the  case  with  M  constructed  as  k  ->  oo  does  not  exceed  1. 


4.  The  error  from  the  left  varies  from  case  to  case,  generally  increasing  with.'  “poor  conditioning” 
of  Ml 

5  Convergence  RJesuIts 

In  this  appendix  wie  show  that  our  scheme  oon verges.  If  m  andl  n.  are  positive  integers,  MmiU  denotes 
thie  collection  of  matrices  having  m  rows  and  n  columns  and  Mn  denotes  the  collection  of  square 
matrices  of  n  rows  and  columns.  For  a  matrix  M ,  we  denote  the  transpose  of  M  by  MT .  The 
matrix  Im  E  Mm  denotes  the  identity  matrix. 

Let  Mi  e  Mm<rt  be  real  where  m  >  n  and  assume  that  Mi  has  rank  n.  From  Theorem  71.3.5  in 
Horn  and  Johnson  [■?],  M  has  a  singular  value  decomposition  of  the  form 

M  =  VT,Wt 

where  V  E  Mm  and  W.  E  Mn  are  unitary  and  real, 


where  D  E  Mn  is  a  diagonal  matrix  whose  diagonal  elements  consist  of  the  strictly  positive  singular 
values  of  M  and  0  E  Mm_n,n  ils  a  zero  mat  riba.  We  note  that 

mt  =  wetv f. 

The  Moore-Penrose  inverse  for  M  is  the  matrix  X  E  given  by 

X,  =  W&V F  (8) 

where 

S*  =  P  D~l  0  ] 

and  0  E  Mnjm_n  is  a  zero  matrix. 

The  following  procedure  is  developed  for  finding  the  Moore-Penrose  inverse  of  the  matrix  M. 
Set  Jt(0)  =  MT  and  for  k  =  0, 1, 2, . . . 

X(>k  +  1)  =  X\(k)  +  X(k)(:Im-MX(H)]At. 

Lemma  1  For  each,  k  —  0,1,2, ,  the  matrix.  X (k)  has  a  singular  value  decomposition  of,  the  form 

X{U)  =  WYfivF. 

Proof.  Th'e  proof  is  bly  induotion.  The  result  is  true  for  k  —  0  since  .Xi(]0)|  =  MT  and  MT  has  the 
desired  decomposition.  Now  assume  the  result  is  true  for  k.  Then 

x,(\k  +:  =  x(k]  +  x{k}(im-MX{k))At 

=  WSlVT  +  WY?kVT{VImVT  -  VXWTW?%VT)At 
=  W[Z[  +  Y?k  (Im  -  Slf )  A t]  VT. 


Thus  XQk  -h  U)  has  the  dlesiredl  decomposition  with 


E^Ef  +  E^-EEjOAt 


which'  completes  the  proof. 
We  write 


1  Dk 
i  o 


where  Dk  €  Mn  is  a  diagonal  matrix  and  0  G  Mm-n>n  is  a  zero  matrix.  Moreover, 

MX,(k)  =  VXWTWY?kVT 

=  VZT%Vt 

where  EE]T  G  Mm  is  of  the  form 


EE 


T 


1  DDl  o 

0  0 


Theorem  1  The  A t’s  can  be  chosen  so  that  X(\M)  converges  to  the  Moore- Penrose  inverse  X  of  M. 


Proof.  Since  the  Moore-Penrose  inverse  is  of  th'e  form  (jJ),  from  Lemma  U  it  suffices  to  show  that 
converges  to  D-1.  This  is,  however,  equivalent  to  showing  that  DD^  converges  to  Jn. 

L’et  ,  A*  be  the  diagonal  elements  of  DD^.  Since  Dq  =>  D,  it  follows  that  for  k  =  0 

the  A’s  are  the  strictly  positive  singular  values  of  M.  From 

MX{k  +  1)  =  FEWT{WiE^+,Ei’(7m-EE^At]  V1T}: 

=  V  [SE^  +  EE £ (j Im  -  SS£)  At]  V1T 


it  follows  that 


A*+1  =  A}  +  A*(I-A$A  t. 

For,  At  >'  0,  the  following  observations  can  be  made  directly  from  (2): 


(i)  If  A^  -  li,  then  Aj+1  =  I. 

(ii)  If  Ajf  <  1  then  A*+1  >  Aj. 
(in)  If  A \  >  1  then  Af+1  <  \). 
(iv)  If  we  choose  At  =  1/A  *  then 


(9) 


Af 1  = 


\)  -H  A}(1  -  A})l/Aj  =  1. 


Now  let  A^ax  =  maxfA^j  1}  and!  chioose  At  =  1/A^ax.  Then  for  every  j  such  that  A|  >'  II 
we  have 

Af1  =  aJ+:P-aJjaJ/aLc 
>  aJ-h{i-a}j 
=  1. 

So  that  from'  (in], 

A*  >  Aj?+1  >  1. 

For  every]  j  such  that  A^  <  li  we  have 

\k-hV  _  \fc  I  /.I  \h\\k/\k 

Aj  “■  Aj  *  K1  Aj)Ajt A max 

«  A*  +  (1  —  A*) 

-  II. 

So  from  (ii) 

Af  <  Aj*1’  <  II. 

From  (jiv)  our  choice  of  At  guarantees  that  at  each  step  the  largest  A^  becomes  1  at  the  next  step. 
From  (i],  once  a  A^  is  1  it  stays  at  1.  Thus  choosing  At  in  this  way  the  algorithm  will  converge  in 
n  steps  to  the  Moore-Fenrose  inverse.  This  completes  the  proof  of  the  theorem. 

Let  us  make  a  couple  of  remarks  about  the  evolution  of  the  At’s.  The  sequence  of  At’s  as  given 
by  the  proof  are  not  necessarily  monotonic.  We  note  that  if  A^ax  >'  1J  then  At  as  chosen  in  the  proof 
will  be  less  than  1  but  greater  than  the  At  chosen  at  the  previous  step.  If,  however,  A^ax  <  I,  then 
the  At  will  be  bigger  than  1  but  it  is  not  necessarily  larger,  then  the  At  at  the  previous  step. 

To  construct  a  nondecreasing  sequence  of  At’s  we  chose  At  =  1  whenever  A^ax  <  1.  Note  that 
in  this  scenario 


A*+1  =  1  -  (1  -  Af)2.  (10) 

Corollary  1  Suppose  that  fpr  some  k0,  A^°ax  <  1.  Let  J  =  {j  1  A*°  <  1}.  Set  At  =  1  for  all 
k  >,  k0.  Then  for  all  j  E  J . 

Xj  1 

as  k  -»  oo. 

Proof.  Define  T  3  (0, 1]  — >1  (Q,  1]  by 

T(z)  =  1  -  (1  -  z)2. 

Then 

1  -  T{z)  =  (1  -z)2. 


Pick  Zq  £  (0,  II]  and!  define  for  k  =a  II,  2, . . . 


Z'k  =* 

Then  an  induction  arguement  shows  that 

1  -  =  (1  -  20)2  . 


Thus  z*,  — >  11  as  /d  — >  oo. 

Choose  j  £  J  and  set  zo  =  A j°.  Then  from  (10),  Zk  ^  A^°+At  and  the  result  now  follows. 

6  Conclusions  and  Further  Research! 

Hanson  showed  that  GIL  is  an  acceptable  neural  network  for  pattern  recognition.  She  used  standard 
SVD  fools  to  find  pseudo-inverses.  Due  to  the  general  nature  ofl  the  problem  at  hand  and  the 
solution  characteristics  reqiuired  for  implementation  on  a  chip,  such  techniques  are  not  generally 
acceptable.  So  a  technique  for  finding  the  pseudo- inverse  of  a  matrix  over  which  we  have  control 
became  necessary,  which  has  been  the  major,  part  of  this  work.  We  have  in  this  present  work 
developed  such  a  tool,  that  incorporates  speed'  and  reliability  for  computing  the  pseudo-inverse  of  a 
matrix,  upon  which  GIL  relies  for.  training. 

The  convergence  of  the  scheme  as  well  as  the  speed  of  convergence  have  both  been  guaranteed 
theoretically;  as  well  as  computationally.  The  convergence  results  ([section  5)  not  only  guarantee 
convergence  of  the  numerical  scheme  hut  also  tell  us  that  if  M  is  of  rank  n,  then  in  n  iterations  we 
converge  to  the  pseudo-inverse  using  an  optimal  set  of  step  sizes.  Due  to  the  ill-conditioning  of  M 
and  the  short  comings  of  finite  bit  arithmetic  we  cannot  afford  such  step  sizes.  Thus  we  stick  to  the 
largest  step  sizes  that  guarantee  stability  of  the  scheme  while  maintaining  a  high  convergence  rate. 

Also  we  have  studied  the  evolution  of  some  important  quantities  in  the  scheme  forming  the  basis 
for.  a  stopping  criteria  while  gaining  more  insight  into  the  evolution  of  the  iterates.  This  insight 
enables  us  to  obtain  “good”  approximations  of  the  pseudo-inverse  while  avoiding  brittle  fits.  Based 
on  our  observations  from  section  (4)  a  stopping  criteria  should  be  application  specific.  A  suitable 
stopping  criteria  should  take  into  account  speed,  accuracy  and  the  environment  under  which  the 
GIL  neural  network  is  implemented.  The  choice  of  a  stopping  criteria  should  start  with  speed 
requirements.  The  accuracy  level  needed  should  then  determine  if  such’  an  environment  exists  that 
can  provide  the  necessary  accuracy  without  compromising  the  speed. 

Preliminary  results  confirmed  that  the  design  of  a  microchip  to  implement  the  system  of  equa¬ 
tions  in  such1  a  network  could  use  as  low  as  8  bits.  Although  finite  bit  arithmetic  based  on  8  bits 
slows  down  th'e  convergence  rate  ofl  the  weights  per  iteration,  the  overall  computational  costs  is 
cheaper. 

What  we  would  like  to  do  next  is  to  simulate  an  entire  network  for  tracking  a  dynamical'  system 
using  GIL  for  vision.  To  proceed  with  such  simulations  knowledge  of  important  parameters  in 
practical  scenerios  is  necessary.  For  example,  the  kind  of  camera  available  for  use  in  such  applications 
is  very  important.  The  camera  speed  and  response  time  in  moving  it  to  new  azimuth  and  elevation 
angles  should  form  a  very  critical  pa/rt  ofl  the  simulations.  Also  we  need  to  address  the  question 


of  the  optimal  window  size.  Wlhat  window  size  would  allow  the  system  to  keep  tract  of  the  target 
while  keeping  the  computational  costs  at  a  minimum.  To  summarize,  we  would  like  to  apply  the 
algorithm  to  a  tracking  problem  with  realistic  parameters. 

Given  the  present  estimates  of  speed  of  training  as  shown  both  experimentally  and  theoretically, 
this  tool  promises  an  acceptable  speed  in  some  real  time  applications.  By  optimizing  the  necessary 
steps  that  are  computationally  expensive,  we  hope  to  improve  the  speed  of  the  entire  network.  The 
simulations  will  provide  the  basis  for  estimating  the  overall  computational  costs  of  such  a  network. 
These  estimates  will  in  turn  provide  the  necessary  informaition  for  a  tracking  camera  experiment  in 
which  GIL  will  provide  vision  calculations.  Finally^  we  hope  to  provide  enough  information  for  the 
decision  on  whether  to  proceed  to  hardware  or  not.  We  hope  to  collaborate  with  hardware  experts 
during  this  phase  of  the  work  as  important  hardware  decisions  will  have  to  be  made. 
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